{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<center>\n",
    "<img src=\"../../img/ods_stickers.jpg\">\n",
    "## Открытый курс по машинному обучению\n",
    "<center>Автор материала: Куликов Павел Викторович, @kulikovpavel."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# <center>ELI5 - библиотека для визуализации и отладки ML моделей</center>\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "_uuid": "f43ebff8d97c806cd841e0ba22645d3afd39a365"
   },
   "source": [
    "Ссылки:\n",
    "\n",
    "[Документация](http://eli5.readthedocs.io/en/latest/) (отличная!)\n",
    "\n",
    "[Github](https://github.com/TeamHG-Memex/eli5/blob/master/docs/source/index.rst)\n",
    "\n",
    "Авторы: Михаил Коробов ([@kmike](https://opendatascience.slack.com/messages/@U064DRUF4)), Константин Лопухин ([@kostia](https://opendatascience.slack.com/team/U0P95857C))\n",
    "\n",
    "[Мотивационное видео](https://www.youtube.com/watch?v=pqqcUzj3R90)\n",
    "\n",
    "Установка\n",
    "\n",
    "```pip install eli5```\n",
    "> "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "_uuid": "2643287e8f661e74bad803c16744aac1c2490775"
   },
   "source": [
    "Библиотека из коробки умеет работать с линейными моделями, деревьями и ансамблями (scikit-learn, xgboost, LightGBM, lightning, sklearn-crfsuite) и в красивом виде показывает значимость признаков, может строить деревья, как текст или как картинки. Кроме этого есть важный функционал анализа предсказаний, можно визуально оценить, почему для того или иного примера ваша модель выдала тот или иной результат\n",
    "\n",
    "![](https://raw.githubusercontent.com/TeamHG-Memex/eli5/master/docs/source/static/word-highlight.png)\n",
    "\n",
    "Может работать с пайплайнами, в тот числе с HashingVectorizer и даже с препроцессингом в виде черного ящика, реализация алгоритма [LIME](https://arxiv.org/abs/1602.04938)\n",
    "\n",
    "У библиотеки настолько прекрасная документация и подробные примеры, что просто проанализирую пару датасетов, а за остальным  лучше к ребятам на сайт\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "_uuid": "2dc415580683adda95341c92e549a32af64ba5fc"
   },
   "source": [
    "## XGBClassifier and LogisticRegression, categorial\n",
    "\n",
    "Young People Survey. Explore the preferences, interests, habits, opinions, and fears of young people\n",
    "\n",
    "[Ссылка на датасет](https://www.kaggle.com/miroslavsabo/young-people-survey)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "_cell_guid": "b1076dfc-b9ad-4769-8c92-a6c4dae69d19",
    "_uuid": "8f2839f25d086af736a60e9eeb907d3b93b6e0e5"
   },
   "outputs": [],
   "source": [
    "import eli5\n",
    "import numpy as np  # linear algebra\n",
    "import pandas as pd  # data processing, CSV file I/O (e.g. pd.read_csv)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "_cell_guid": "82f01488-9c67-4400-801b-e5a459b6a3ec",
    "_uuid": "e0a0617445dfd0f1920afe2b3c7dc7a78eb73b14"
   },
   "outputs": [],
   "source": [
    "df = pd.read_csv('responses.csv')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "_uuid": "6885b2da2bfb78a5c0bc2befe6c6b9124e622b84"
   },
   "outputs": [],
   "source": [
    "df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "_uuid": "4f3808787e302949866899ec66ac796d4ba5a729"
   },
   "source": [
    "Возьмем в качестве целевой переменной место, где живет человек, деревня или город"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "_cell_guid": "289d7028-83ce-4756-87c5-3f1f5496346e",
    "_uuid": "5105719a67111030aa10c2a7f730de0842956b26"
   },
   "outputs": [],
   "source": [
    "df['Village - town'].value_counts()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "_uuid": "5d6c95640560eb9f32f17592d014829a8a4a2d3f"
   },
   "outputs": [],
   "source": [
    "df['Village - town'].fillna('city', inplace=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "_cell_guid": "90792a77-1fe5-4a0b-9dd2-d7710794c4db",
    "_uuid": "55f2de35757110684fd9a1bb46d493e9db47cf84"
   },
   "outputs": [],
   "source": [
    "X = df.drop(['Village - town'], axis=1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "_cell_guid": "65e19591-5c5d-494c-80b1-d7aba0e93bb5",
    "_uuid": "fc3cf4062f6e4059c34c02be593a7b6b001efaee"
   },
   "outputs": [],
   "source": [
    "target = df['Village - town'].map(dict(city=0, village=1))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "_cell_guid": "68f5cd71-3644-437b-8595-ddb3be814970",
    "_uuid": "45086a710c91741abdd9e5335ac7ae619fe3cfbd"
   },
   "outputs": [],
   "source": [
    "import warnings\n",
    "\n",
    "# xgboost <= 0.6a2 shows a warning when used with scikit-learn 0.18+\n",
    "warnings.filterwarnings('ignore', category=DeprecationWarning)\n",
    "from sklearn.base import BaseEstimator, TransformerMixin\n",
    "from sklearn.feature_extraction import DictVectorizer\n",
    "from sklearn.model_selection import cross_val_score\n",
    "from sklearn.pipeline import make_pipeline\n",
    "from sklearn.preprocessing import LabelBinarizer\n",
    "from xgboost import XGBClassifier, XGBRegressor\n",
    "\n",
    "\n",
    "# workaround for xgboost 0.7\n",
    "def _check_booster_args(xgb, is_regression=None):\n",
    "    # type: (Any, bool) -> Tuple[Booster, bool]\n",
    "    if isinstance(xgb, eli5.xgboost.Booster): # patch (from \"xgb, Booster\")\n",
    "        booster = xgb\n",
    "    else:\n",
    "        booster = xgb.get_booster() # patch (from \"xgb.booster()\" where `booster` is now a string)\n",
    "        _is_regression = isinstance(xgb, XGBRegressor)\n",
    "        if is_regression is not None and is_regression != _is_regression:\n",
    "            raise ValueError(\n",
    "                'Inconsistent is_regression={} passed. '\n",
    "                'You don\\'t have to pass it when using scikit-learn API'\n",
    "                .format(is_regression))\n",
    "        is_regression = _is_regression\n",
    "    return booster, is_regression\n",
    "\n",
    "eli5.xgboost._check_booster_args = _check_booster_args"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "_uuid": "bf0d07a7351028defb87e4745d5681a9ac62886e"
   },
   "outputs": [],
   "source": [
    "def prepare_df(data, columns=None):\n",
    "    if not columns:\n",
    "        columns = data.columns.values\n",
    "        \n",
    "    arr_categorial = list()\n",
    "    \n",
    "    for col in columns:\n",
    "        lb = LabelBinarizer()\n",
    "        transformed = lb.fit_transform(data[col].astype('str'))\n",
    "        arr_categorial.append(pd.DataFrame(transformed, columns=col + '__' + lb.classes_.astype('object')).to_sparse())\n",
    "\n",
    "    concated_df = pd.concat([data.drop(columns, axis=1)] + arr_categorial, axis=1).to_sparse()\n",
    "    return concated_df\n",
    "\n",
    "categorical_columns = ['Smoking', 'Alcohol', 'Punctuality', 'Lying', 'Internet usage', 'Gender', 'Left - right handed', 'Education', 'Only child', 'House - block of flats']\n",
    "binarized_x = prepare_df(X, categorical_columns)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "_uuid": "bedfcb37680eb83a43a76c0e8ca3aa730c2f1bed"
   },
   "outputs": [],
   "source": [
    "xgb = XGBClassifier()\n",
    "\n",
    "def evaluate(_clf, df, target):\n",
    "    scores = cross_val_score(_clf, df, target, scoring='roc_auc', cv=10)\n",
    "    print('Accuracy: {:.3f} ± {:.3f}'.format(np.mean(scores), 2 * np.std(scores)))\n",
    "    _clf.fit(df, target)  # so that parts of the original pipeline are fitted\n",
    "\n",
    "evaluate(xgb, binarized_x, target)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "_uuid": "c1920c1a103048129db34e19b64b859b69e8d181"
   },
   "outputs": [],
   "source": [
    "eli5.explain_weights(xgb, top=50)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "_uuid": "7e22b44c7da6dab68f27252405450d0fd441ac70"
   },
   "source": [
    "Важность признаков для классификатора. По умолчанию используется  прирост информации, \"gain”, среднее значение по всем деревьям. Есть другие варианты, можно поменять через свойство importance_type.\n",
    "\n",
    "Мы можем взглянуть теперь на конкретный пример"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "_uuid": "e5c8bccb3240157597f2bf4afd3553a37025325f"
   },
   "outputs": [],
   "source": [
    "eli5.show_prediction(xgb, binarized_x.iloc[300], show_feature_values=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "_uuid": "914540b58a0bc130ba933c4d85a21af810df9b91"
   },
   "source": [
    "Получили, что данный участник, вероятно, живет в городе, потому что не живет в квартире, тратит деньги на благотворительность и носит брендовые вещи\n",
    "\n",
    "Посмотрим на логистическую регрессию"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "_uuid": "9193fd78e60546d9d4d93766971d60acf3284d44"
   },
   "outputs": [],
   "source": [
    "from sklearn.linear_model import LogisticRegression\n",
    "\n",
    "lr = LogisticRegression()\n",
    "\n",
    "evaluate(lr, binarized_x.fillna('0'), target)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "_uuid": "3469c654f46e01fa8a8f6b14afc75cd75fdd8cbf"
   },
   "outputs": [],
   "source": [
    "eli5.show_weights(lr, feature_names=binarized_x.columns.values, top=100)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "_uuid": "bbf19bb2535fae6ceae89e926b4d4b65bb37ea11"
   },
   "outputs": [],
   "source": [
    "eli5.show_prediction(lr, binarized_x.iloc[300].fillna('0'), show_feature_values=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "_uuid": "e01720f15d6494f2027a6debbdbe6bdc4bcbfd02"
   },
   "source": [
    "Сразу заметно, что мы допустили ошибку (не отскалировали величины), и логистическая регрессия напрасно берет вес и рост как сильный значимый фактор, причем вес в плюс, а рост в минус, по сути компенсируя взаимно (факторы скоррелированы). И возраст тоже. Переобучение."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "_uuid": "9044071b9217988bca25d011fb35675d0dbe4a0e"
   },
   "source": [
    "## Анализ текста\n",
    "\n",
    "First GOP Debate Twitter Sentiment. Analyze tweets on the first 2016 GOP Presidential Debate\n",
    "\n",
    "[Ссылка на датасет](https://www.kaggle.com/crowdflower/first-gop-debate-twitter-sentiment)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer\n",
    "from sklearn.linear_model import LogisticRegressionCV\n",
    "from sklearn.pipeline import make_pipeline\n",
    "\n",
    "df = pd.read_csv(\"Sentiment.csv.zip\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "vec = CountVectorizer()\n",
    "clf = LogisticRegressionCV()\n",
    "pipe = make_pipeline(vec, clf)\n",
    "pipe.fit(df.text, df.sentiment)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "eli5.show_weights(clf, vec=vec, top=20)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "eli5.show_prediction(clf, df.iloc[140].text, vec=vec)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "vec = TfidfVectorizer(analyzer='char_wb', ngram_range=(3,10), max_features=20000)\n",
    "clf = LogisticRegressionCV()\n",
    "pipe = make_pipeline(vec, clf)\n",
    "pipe.fit(df.text, df.sentiment)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "eli5.show_weights(clf, vec=vec, top=20)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "eli5.show_prediction(clf, df.iloc[140].text, vec=vec)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "При работе с большими объемами часто применятеся HashingVectorizer, для уменьшения размерности признакового пространства. ELI5 поддерживает работу с такими преобразованиями с помощью инвертирования.\n",
    "\n",
    "```\n",
    "from eli5.sklearn import InvertableHashingVectorizer\n",
    "import numpy as np\n",
    "\n",
    "vec = HashingVectorizer(stop_words='english', ngram_range=(1,2))\n",
    "ivec = InvertableHashingVectorizer(vec)\n",
    "sample_size = len(twenty_train.data) // 10\n",
    "X_sample = np.random.choice(twenty_train.data, size=sample_size)\n",
    "ivec.fit(X_sample);\n",
    "```\n",
    "\n",
    "http://eli5.readthedocs.io/en/latest/libraries/sklearn.html#reversing-hashing-trick"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## LIME, черный ящик в текстовой обработке\n",
    "\n",
    "Идея заключается в том, чтобы чуть-чуть менять входные строки, убирать случайным образом слова-символы, и смотреть как меняются предсказания модели, таким образом запоминать их влияние на на модель\n",
    "\n",
    "http://eli5.readthedocs.io/en/latest/tutorials/black-box-text-classifiers.html"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "###\n",
    "\n",
    "Павел Куликов\n",
    "\n",
    "kulikovpavel@gmail.com\n",
    "\n",
    "+7 903 118 37 41"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
